Importing data

Data Exploration

Train dataset has missing values in Age, Cabin and Emarked.
Test dataset has missing values in Age, Cabin and Fare.

Running the profile report helps us get a glimpse of the distribution of the data, their correlation, the columns that requires conversion.

Encoding Categorical Columns and Filling Missing Values

i. Creating SexId to have female = 0, male = 1

ii. Fill missing age

The Age and SibSp columns have high negative correlation. We will use the SibSp column to help us fill the missing Age values.

We will find the median age of each SibSp group and create a map.

This age map will be used to fill the missing Age values in both train and test datasets. Note that this map is entirely built on the train dataset.

We found that this map is missing median values for SibSp > 5, so we just take that median for SibSp = 5

iii. Encoding Embarked Column

Fill missing values before encoding

Before we encode this column, note that there are a couple of missing values (nan) in the column. We can ignore them and go straight with the encoding step and then we may have a little problem with the "nan" column name in that the column name has the type of float (nan) instead of string. You can call the column with numpy's nan -> df[np.nan]

If you don't like that you can use Pandas' fillna() to fill them first, like what we are doing below.

Use sklearn's One Hot Encoder to encode the port names

You can also use Pandas' get_dummies() for this task, but you will miss the fit and transform function in sklearn. -> pd.get_dummies(train.Embarked)

Dropping the unknown column (UNK)

NOTE: dropping a one hot column doesn't affect logistic regression result, but affects the K-Means results

Create new trimmed datasets

We have filled all missing values in the train dataset but still missing one value in the Test dataset. We will just plug that by the passenger's Pclass.



Further Analysizing the Train Dataset

With that we can see the survival rate of each Pclass

We can see that the among the survivors (Survived = 1), the highest proportion came from first class even first class passengers were the fewest.

We will do the same for sex.

Also here we can see that most of the women survived and most the men didn't.

Next we will something similar for embarking ports.

Then we see the C port has higher survival rate. A reasonable guess would be it is where first class passsengers got onboarded.

With the summary above, we can have several observations:

  1. Survival rate (342 survived and 549 dead) = 38.3%
  2. There were more first class passengers than second class passengers. And "Age" and "Pclass" has some negative correlation which is reasonable assuming more aged people were more wealthy and purchased better tickets
  3. More males than females
  4. Average age a little lower than 30
  5. 2 people around 30-40 years old paid exceptionally high fares of > $500
  6. Most people embarked on S port which has negative correlation with "Survived" and C port has a little positive correlation with "Survived". "C" port is positively related to first class passengers
  7. "Cabin" may be used for feature engineering and may indicate passengers' locations on the ship but since it has too many missing values we will not be using that
  8. "Survived" has higher correlations with "Pclass" / "Fare" and "Sex"

Data Preprocessing

Supervised Learning : Logistic Regression

We will first standardize and scale the data first and then train the Logistic Regression model.

Try predicting the train dataset and show the probability

Try predicting the test dataset and show the probability

Supervised Learning : SVM

Supervised Learning: Ridge Classification

Supervised Learning: KNN

Supervised Learning: KNN with NCA

Supervised Learning: Tree

Supervised Learning: Random Forest

Supervised Learning: AdaBoost

Supervised Learning: Gradient Boosting

Supervised Learning: Voting Classifier

Cross validation

PCA transforming the data

Outputing the CSV for submission to Kaggle

Results with Kaggle

After submitting 3 results files to Kaggle, i.e. results from logistic regression, voting classification, and also voting classification but with PCA transformed data, we found that although voting classification (without PCA) yields the best result, all 3 scores were only slightly different.

So much of the maggic with machine learning still lies in feature engineering. And this is what we will continue to explore.